Lecture 2 : Stochastic Multi - armed bandit ( IID model )

نویسنده

  • Antoine Désir
چکیده

1.1 Reward model The next step in defining the model is to describe how the reward are generated. This is where the stochastic assumption and IID model comes in. In particular, we assume that the reward from each arm i follows a distribution νi with mean μi. When pulling an arm i, the reward will be generated independently from the distribution νi. More precisely, let Ht−1 denote the history until time t− 1 (including t− 1). We can write Ht−1 = {(I1, r1), . . . , (It−1, rt−1)}. Then, our assumption on the reward can be written as rt|(Ht−1, It = i) ∼ νi which also implies E[rt|Ht−1, It = i] = μi. In other words, given the history up to time t−1 and the choice of arm It, the reward is drawn independently with respect to the distribution of the chosen arm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-armed Bandit Problems with History

In a multi-armed bandit problem, at each time step, an algorithm chooses one of the possible arms and observes its rewards. The goal is to maximize the sum of rewards over all time steps (or to minimize the regret). In the conventional formulation of the problem, the algorithm has no prior knowledge about the arms. Many applications, however, provide some data about the arms even before the alg...

متن کامل

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to pla...

متن کامل

Lecture 18: Stochastic Bandits

Last time we talked about the nonstochatsic bandit problem which was a partial information version of our online learning problem. Here we studied situations where at each iteration t, the learner chooses an action at and suffers loss `t(at) which is the only thing the learner observes. We showed that the importance weighting trick can be plugging into any full information algorithm with a loca...

متن کامل

Gap-free Bounds for Stochastic Multi-Armed Bandit

We consider the stochastic multi-armed bandit problem with unknown horizon. We present a randomized decision strategy which is based on updating a probability distribution through a stochastic mirror descent type algorithm. We consider separately two assumptions: nonnegative losses or arbitrary losses with an exponential moment condition. We prove optimal (up to logarithmic factors) gap-free bo...

متن کامل

Asymptotically Optimal Multi-Armed Bandit Policies under a Cost Constraint

We develop asymptotically optimal policies for the multi armed bandit (MAB), problem, under a cost constraint. This model is applicable in situations where each sample (or activation) from a population (bandit) incurs a known bandit dependent cost. Successive samples from each population are iid random variables with unknown distribution. The objective is to have a feasible policy for deciding ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016